Evaluating the performance of multilingual models in answer extraction and question generation

Multiple-choice test generation is one of the most complex NLP problems, especially in languages other than English, where there is a lack of prior research. After a review of the literature, it has been verified that some methods like the usage of rule-based systems or primitive neural networks have led to the application of a recent architecture, the Transformer architecture, in the tasks of Answer Extraction (AE) and Question Generation (QG). Thereby, this study is centred in searching and developing better models for the AE and QG tasks in Spanish, using an answer-aware methodology. For this purpose, three multilingual models (mT5-base, mT0-base and BLOOMZ-560 M) have been fine-tuned using three different datasets: a translation to Spanish of the SQuAD dataset; SQAC, which is a dataset in Spanish; and their union (SQuAD + SQAC), which shows slightly better results. Regarding the models, the performance of mT5-base has been compared with that found in two newer models, mT0-base and BLOOMZ-560 M. These models were fine-tuned for multiple tasks in literature, including AE and QG, but, in general, the best results are obtained from the mT5 models trained in our study with the SQuAD + SQAC dataset. Nonetheless, some other good results are obtained from mT5 models trained only with the SQAC dataset. For their evaluation, the widely used BLEU1-4, METEOR and ROUGE-L metrics have been obtained, where mT5 outperforms some similar research works. Besides, CIDEr, SARI, GLEU, WER and the cosine similarity metrics have been calculated to present a benchmark within the AE and QG problems for future work.


Motivation
The Question Generation (QG) task is one of the main Natural Language Processing (NLP) problems, and it is defined as the automatic generation of questions from inputs such as text, raw data and knowledge bases 1 .Specifically, an answer-aware Question Generation system performs the QG task given a target answer (right answer) as an input, besides the passage.Therefore, the task of Answer Extraction (AE) is essential for this system, and it is defined as the task of obtaining a fragment representing a plausible answer directly from the input passage, with no given question.
Answer-aware QG systems can be applied in several real applications such as the creation of open-domain chatbots 2 , the improvement of Question Answering (QA) systems 3 or the creation of new automatic evaluation metrics 4 .Additionally, QG contributes to the appearance of newer fact verification techniques 5 and to the automatization of some educational issues.Particularly, in the field of education, Le et al. 6 stated that it can be used for knowledge or skills acquisition (e.g.writing skills 7 ), as well as for knowledge assessment (e.g.multiple choice questions 8,9 and the creation of tutorial dialogues 10 ).
Even though recent advances in NLP achieved extraordinary results in QG systems, NLP applications are hugely language dependent.In this sense, research in languages other than English is very scarce, i.e., there are much fewer resources available.Particularly, in Spanish there are not enough native models and datasets to perform greater investigations on the AE and QG fields with promising results.Consequently, trying to solve the English language dependency in the field of NLP, some multilingual models have been proposed.These models are pre-trained in several languages and are able to work reasonably well in languages other than English.
Regarding the Answer Extraction (AE) problem, most of the approaches obtain the answer given a passage and the corresponding question (Question Answering) instead of obtaining the answer directly from the text (AE).This methodology was firstly accomplished with rule-based methods 31 and more recently with Transformer-based neural networks 32,33 .Even though the Question Answering technique is the most popular, other studies apply the idea of extracting answers directly from the passage, without a given question.This issue has been addressed by using a wide range of techniques.Firstly, linguistic tags and rules were used 34 .Secondly, See et al. 35 proposed an attention-based pointer generator model, created as a hybrid between a pointer network 36 and a sequence-to-sequence model with attention (similar to the one proposed by Nallapati et al. 37 ).Then, Subramanian et al. 38 made use, exclusively, of pointer networks trained to point sequentially to the start and end locations of the answers.This approach was then improved by Sun et al. 39 , using feature-enriched pointer generator models, by adding features such as named entity (NE) or part-of-speech (POS) in the embedding layer of the encoder.Later, a BERT 29 encoder was used to predict answer spans from the passage.Once several answers have been obtained, they are sorted according to a confidence score and the best answers are selected 40 .In addition, the research done by Rodriguez-Torrealba et al. 41 use fine-tuned T5 models to create multiple-choice questions following the answer-aware technique.Moreover, the use of the Stanford CoreNLP tagger 42 is another common methodology to address the AE task 43 .Using the mentioned tagger, Arumae and Liu 44 applied summarization techniques to improve the AE process performance.Additionally, Dugan et al. 45 used a T5 language model 46 and proved that providing human-written summaries, or automatically generated ones (instead of the source passage), reinforces AE and QA tasks.
Finally, Uto et al. 47 have performed a research study about the difficulty control of QG.For this purpose, a hybrid strategy using BERT 29 for AE and GPT-2 48 to perform answer-aware QG is proposed.Additionally, the difficulty control problem is addressed using the item response theory (IRT) 49 , making it possible to select the difficulty of the generated question-answer pairs.
The evaluation of these studies must be captured with some metrics to be able to compare the performance within different strategies.Numerous research studies make use of the BLEU 1-4 metrics 50 to evaluate the quality of machine-translated text.Additionally, the METEOR metric 51 is also used, since it addresses some weaknesses of BLEU (e.g.including recall).Moreover, ROUGE-L 52 is employed to measure the longest common subsequence between two sequences of characters.As a result, recent studies have published benchmarks including these metrics 53 .However, these metrics are not specific for the QG or AE problems.They are borrowed from other natural language generation (NLG) tasks, where they have shown to correlate weakly with human opinion 54 .Laban et al. 55 ratify this idea by using a new evaluation technique for QG in which a teacher selects a quiz concept, chooses which candidate questions from various models to include in the quiz and gives a reason to reject the others, concluding that better metrics are needed to guide future progress in QG and NLG research.

Materials and methods
This work follows an answer-aware QG technique.Since the use of this technique requires to previously know the answer, it will be obtained in advance by following an AE methodology.Both the AE and QG tasks have been addressed with three different models: mT5-base, mT0-base and BLOOMZ-560 M, by generating question-answer pairs.Finally, these pairs are evaluated by using different metrics, making possible the performance comparison of the models in the AE and QG problems.

Datasets
Regarding data, two datasets are being used, SQAC and SQuAD.Additionally, a new one was created by joining the SQuAD and SQAC datasets.These datasets are explained below.
SQAC: the Spanish Question Answering Corpus (SQAC) is a set of 18,817 answerable questions mainly extracted from the Spanish Wikipedia and the Spanish Wikinews websites 56 .
The dataset was created using an extractive method, so additional knowledge is not required to answer those questions, as they are based on their associated texts.
SQAC contains 15,036 examples in the training set, and its validation and test sets have been joined into a single validation set that contains a total of 3,774 rows.
SQuAD: the Standford Question Answering Dataset (SQuAD) is a selection of questions based on a collection of articles obtained from the English Wikipedia 57 .
It was created due to the need for a large and high-quality dataset.In this study, the first version (v1.1) is used, as every question has its answer, as opposed to the second version (v2.0), which includes 50,000 + unanswerable questions.
Since our objective is to create Spanish-based models, we used an automatic translation of this dataset.It contains 87,595 examples in the training set and 10,570 in the validation set.SQAC + SQuAD: this dataset is the union of the two previous ones.This combination was made to evaluate if models trained with this data can generalize better.Likewise, we assess if the generated predictions are written in a better Spanish, as SQAC was created in this language from scratch.Finally, since deep learning methods need as much information as possible to train properly, this experiment is performed to analyze whether the improvements are substantial or not.
The resulting dataset is composed of 102,631 questions on the training set, being 15,036 examples from the SQAC dataset and 87,595 from SQuAD.Besides, their validation sets have been gathered into a 14,344-example validation set, so the performance of each model can be computed globally for the union of both datasets.www.nature.com/scientificreports/

Models
We are using two similar encoder-decoder models (mT5-base and mT0-base) and one only-decoder model (BLOOMZ-560 M).All these are multilingual models, pre-trained with several languages, including Spanish.These models were included into a pipeline that generates multiple-choice tests in Spanish, hence we look for different ways to improve the quality of questions and answers.Thus, our research study focuses on the comparison between different approaches.
Firstly, we used mT5 since text-to-text models are suitable for generative tasks.These models receive some text as an input, then they return another text as an output, which is, in our case, an answer (AE) or a question (QG).
mT5 is a multilingual variant of T5 pre-trained with mC4, a version of the C4 dataset with examples in 101 different languages 58 .mT5 inherits T5 properties, which has some implications for the inputs, since the task needs to be specified for the model to give an appropriate output 46 .The architecture of mT5, along with mT0, is based on that of T5, specifically the enhanced T5.1.1 version.This updated version improves upon the original T5 by incorporating features such as GeGLU nonlinearities 59 .
Once we had a benchmark to compare, we also fine-tuned mT0 and BLOOMZ, which are multitask fine-tuned versions of the original mT5 and BLOOM models, respectively, optimized for enhanced crosslingual and multitask capabilities.Notably, BLOOMZ adopts a decoder-only architecture and integrates modern enhancements, like ALiBi Positional Embeddings 60 and an additional post-embedding normalization layer for better training stability.Due to the token limit that affects BLOOMZ, which can only process up to 2,048 tokens 61 , and due to hardware limitations, the data preprocessing was adapted to tackle these issues.

Data preprocessing
Since two different problems are solved, a separate format is used for AE and QG.Nonetheless, processing is common to both of them.
Due to the token limitation mentioned above, we needed to reduce the input length below 2,048 tokens for some contexts.However, because of further hardware limitations, the inputs were finally limited to 1,024 tokens for the mT0 and BLOOMZ models, and 512 tokens for the mT5 model.The context length, in addition to some relevant hyperparameters used during the training of the models, can be found in Table 1.While removing tokens, the part of the context related to the target (the desired output) could be deleted.To solve this, we highlight either the phrase where the answer is (AE) or the answer from which the question will be generated (QG) with the tag < h1 > , so the parts of the context that do not contain that label can be removed.
This technique is called context-aware answer extraction/question generation, and helps to reduce the number of tokens.In addition, it makes models focus on a limited span of context.In this case, the mentioned span would contain the targeted answer or a phrase to extract answers from.This provides more accurate results because the models will not be centered on other possible occurrences of the answers throughout the input 62 .
Please note that a similar data preprocessing step must be carried out before inference.
1.For the AE task, a context with n sentences will generate n inputs for the AE model.In each of those inputs, a different sentence must be highlighted with the < h1 > tag, and this must be disjoint from the highlighted sentences in other inputs.2. Regarding the QG task, every input must have exactly one answer highlighted within the context, using the < h1 > tag.The answer may be extracted and labelled manually, or alternatively, an AE model may be employed to retrieve an answer to automatically label it within the context.
Regarding the tokenizer, it was trained with the special token < sep > , which lets the model identify different important fragments.For example, this token can separate the ending of an answer with the beginning of the next one or to split the information given to the model (passage) from the command (extract answers).
AE format: first, a feature selection is performed to obtain the title, context, and answers from each row.Then, for each context in the dataset, its answers are gathered.To group the different answers that belong to the same context, the title is used, which will be discarded after this step.
With this procedure, we have removed duplicated contexts.Now, we want to delimit every phrase of the passage, obtained by sent_tokenize 63 , so the answers can be grouped per phrase using their answer_start attribute.
Table 1.Hyperparameters used to train the mT5, mT0 and BLOOMZ models, respectively.The mT5 model was trained with different hyperparameters for its QG and AE versions due to hardware limitations, while mT0 and BLOOMZ were trained using the same hyperparameters in both cases.www.nature.com/scientificreports/sent_tokenize retrieves the n sentences within a text.For each of those sentences, a duplicate of the original context is created, in which its corresponding sentence is highlighted with the < h1 > tag.Consequently, n possible inputs are obtained.Then, once the answers have been gathered by the sentence they belong to, they are formatted as it follows: , where every answer belongs to the same sentence.Finally, input-target pairs (registers) are created, where the input is a context with a highlighted sentence and its target are the related answers, so the texts without a target are discarded.Last, we format the input to indicate the task they need to perform.As mT5 and mT0 are similar models, we used the same prompt for both.For BLOOMZ, we used a prompt like the suggested templates in the work of Muennighoff et al. 61 .The resulting inputs can be better seen in Figs. 1 and 2.
QG format: unlike the previous task, in the Question Generation problem only one question per answer is obtained, so removing duplicated contexts is not required.
To reduce the dimensionality of the problem, the context, the targeted question, and the answer that would respond to the question are selected.The answer is then used to highlight it in the context using the < h1 > tag, as previously mentioned.In the case of the mT5 model, the answer is discarded, as it is no longer needed, and a simple prompt is added, as illustrated in Fig. 5. Conversely, mT0 and BLOOMZ models also require the answer for the prompt creation (Figs. 3 and 4), and thus, it is dropped after the prompt is generated and coupled to the rest of the input.In these cases, the template changes from the previous task, and it is shown in Figs. 3, 4 and 5 for the mT0, BLOOMZ and mT5 models, respectively.Please note that, same as for the QG task, a prompt similar to those suggested by Muenninghoff 61 for the BLOOMZ and mT0 models is being used.
The algorithms used for both AE and QG can be found in Appendix A of the Supplementary Information.

Evaluation metrics
Once the different models have been trained, inference is done with the validation set of each dataset, obtaining some predictions to compute the different metrics.These metrics have been applied to conclude which strategy provides better results for the AE and QG problems.
The BLEU 1-4 50 , METEOR 51 , ROUGE-L 52 and CIDEr 64 metrics are calculated by using the nlg-eval tool 65 .Besides, the cosine similarity obtained from the vectorial representation of the text 66 has been calculated using the sentence_similarity_spanish_es model.Finally, the SARI 67 , GLEU 68 and WER 69,70 metrics have been obtained to study their performance in the AE and QG problems as discriminators between models, all of them computed using HugginFace's library "Evaluate".
• BLEU: this metric was proposed for evaluating machine-translated text.It compares n-grams overlapped between the original sentence and the predicted one.In this study, it is used to evaluate the quality of the generated questions and answers where it is expected that, the more similar they are to the reference, the greater fidelity they have with the original text, which is crucial for multiple-choice test generation.• METEOR: this metric is used to evaluate results in machine translation problems.It performs unigram alignments between the reference and predictions, which are based on exact, stem, synonymy and paraphrase matches 65 .It solves some issues that BLEU presents (e.g., lack of recall, the use of higher order n-grams and the geometric averaging of those n-grams), so it could be a better metric to evaluate whether two questions preserve their meaning despite being paraphrased.• ROUGE-L: it extracts the longest common subsequence between a given reference and the obtained predic- tion.Since the generated questions may differ from the original ones but can preserve their original meaning, this metric is not a good discriminator for QG models.On the contrary, in the AE problem, the main objective is to extract answers directly from the text, so rephrasing or using synonyms is not entirely correct, as the original meaning may be distorted.Thus, a higher ROUGE-L value in AE means that predictions are faithful to the original answers.• CIDEr: this metric was proposed for evaluating the quality of image descriptions.In a similar way to the met- rics explained above, it compares n-grams among the several descriptions given as inputs, using a stemmed www.nature.com/scientificreports/representation of the text.The most frequent n-grams will be given a lower weight, as it is understood that they are likely to be less informative.Thus, CIDEr could be a good metric to evaluate whether the questions are meaningful or not.• Cosine similarity: this metric is computed using a vectorial representation of the text.With the obtained vectors, the cosine of their angles can be calculated, which provides an intuition of the similarity of those two words.Please note that similar (related in meaning) words have values close to one (1), whilst identical words present a value of one (1).Similarly, the values of unrelated words are close to zero (0), and orthogonal vectors will have a value of zero (0).• SARI: this metric compares the System output Against References and the Input sentence.It is computed by calculating the precision and recall for word additions, rewarding those that are found in the references and penalizing the ones that are not.Similarly, the precision and recall are calculated for those words kept in the output, and the precision for the deleted ones.In a last step, the arithmetic average is calculated, using the F1-score of the added and kept words and the precision of deletions.Xu et al. 67 state that SARI is more correlated with simplicity than meaning, where other metrics have a better performance (e.g., BLEU).Considering both metrics (SARI and BLEU), two different models could be compared to analyze which of them generates questions that better preserve their original meanings, and then choose the model which provides simpler and more meaningful questions.• GLEU: the GLEU score is a metric proposed by Wu et al. 68 and is a slightly modified version of the BLEU metric.According to the study, since BLEU was designed as a corpus measure, its performance misbehaves when evaluating individual sentences.Hence, GLEU tries to solve this undesirable property of BLEU.The procedure is the following: first, it extracts n-grams of 1, 2, 3 or 4 tokens from the target and output sequences.In this case, mT0 has a similar format to BLOOMZ.Nonetheless, as mT0 is an encoder-decoder model, the target is separated from the input.The translation into English is the following: Raw Input: "context": "Javier is 10 years old and his brother is 15.Javier has curly hair." "question": "How old is Javier?" "answers": {"text": "10 years old", "answer_start": 13}.Figure's main body: < s > Given the following context "Javier is < h1 > 10 years old < h1 > and his brother is 15.Javier has curly hair.", < sep > generate a question whose answer would be: "10 years old".Input: < s > Given the following context "Javier is < h1 > 10 years old < h1 > and his brother is 15.Javier has curly hair.", < sep > generate a question whose answer would be: "10 years old".Target: "How old is Javier?".
Vol:.( 1234567890) Then, recall and precision are calculated and, finally, the minimum between them is chosen.Note that this metric is bounded between 0 (no matches) and 1 (exact match).In this study, since metrics are computed by comparing sentences individually, GLEU may provide better results than BLEU.• WER: Word Error Rate is a commonly used metric in speech recognition problems.It considers the number of substitutions, deletions, insertions, and correct words in a predicted text, given a reference.In this study, for the AE and QG problems, WER is used to evaluate how each prediction differs from its original reference.One problem must be taken into consideration: deletions and insertions are not symmetric.This issue may make WER misbehave, giving values greater than 1 69 .

Results
Using the evaluation metrics, several results have been obtained.Tables 2, 3 and 4 show the performance of the three models, finetuned with different datasets, on the AE task, and Tables 5, 6 and 7 on the QG task.Tables 8  and 9 present the results of performance of the mT0 and BLOOMZ base models on these tasks for the different datasets.Each table has three columns for each model: the first one shows the results obtained for the SQuAD validation set, the second one for the SQAC validation set and the third one for the SQuAD and SQAC validation sets combined together.This last column has been calculated to show the performance of the model in a more general context, with different passages and question structures from both datasets.However, it must be taken into consideration that the SQAC dataset contains more records than SQuAD.
Regarding the results of the models fine-tuned with SQuAD for Answer Extraction shown in Table 2, mT5 outperforms the other ones in all the evaluation metrics except in SARI, where BLOOMZ achieves better results for the SQuAD and SQAC + SQuAD datasets.
According to the results shown in Table 3 about the models fine-tuned with SQAC for Answer Extraction, mT5 achieves better results for each metric except for WER, cosine similarity and SARI.Indeed, BLOOMZ Figure 4. Format for the BLOOMZ model for the Question Generation task.The translation into English is the following: Raw Input: "context": "Javier is 10 years old and his brother is 15.Javier has curly hair." "question": "How old is Javier?" "answers": {"text": "10 years old", "answer_start": "13"}.Figure's main body: < s > Given the following context "Javier is < h1 > 10 years old < h1 > and his brother is 15.Javier has curly hair.", < sep > generate a question whose answer would be: "10 years old" Question: How old is Javier?< /s > .Input: < s > Given the following context "Javier is < h1 > 10 years old < h1 > and his brother is 15.Javier has curly hair.", < sep > generate a question whose answer would be: "10 years old" Question: How old is Javier?< /s > .Concerning the results about the models fine-tuned with SQAC + SQuAD for Answer Extraction, shown in Table 4, mT5 outperforms the results obtained except for SARI, where BLOOMZ outperforms the other results.
Analyzing the scores of the first three tables, the models fine-tuned with the union of SQAC and SQuAD datasets achieve, in general, better results for the evaluation metrics computed.
Focusing on the results about the performance of the three models fine-tuned with SQuAD in Question Generation (Table 5), mT5 achieves the best results for all the metrics except for SARI and WER.The best results for the SARI metrics are achieved by mT0 for SQAC and by BLOOMZ for SQuAD and SQAC + SQuAD.The best results for the WER metric are obtained by mT0 for SQuAD, by BLOOMZ for SQAC and by mT5 for SQAC + SQuAD.
According to the results obtained by models fine-tuned with SQAC for Question Generation (Table 6), mT5 achieves the best results except for WER and SARI.Concretely, BLOOMZ obtains better results for the SARI metric and mT0 outperforms the results of cosine similarity and WER in SQuAD.
Regarding the models fine-tuned with SQAC and SQuAD for Question Generation (Table 7), mT5 also outperforms the other models in all metrics except SARI, where BLOOMZ achieves better results.
Analyzing the results obtained (Tables 5, 6, 7), the models fine-tuned with SQAC + SQuAD achieve, in general, better results for the evaluation metrics computed.
Finally, regarding the results obtained from the base models (not fine-tuned), applying a zero-shot approach, for the task of AE (Table 8) and QG (Table 9), mT0 achieves better results in both tasks except for the metrics WER and SARI, where BLOOMZ outperforms the results.Specifically, BLOOMZ achieves better results for SARI in all datasets in both tasks, for WER in all datasets for AE and in SQuAD and SQAC + SQuAD for QG.
Once all the evaluation metrics are computed, some results must be highlighted.

A. MT5 models
Regardless of the dataset used, mT5 models achieve, in general, the best results for every evaluation metric except SARI, where the other models sometimes achieve better results.As an example, the best results according to the BLEU4 metrics for QG and AE are obtained with these models.
Focusing on the values obtained, the best mT5 model is the one fine-tuned with the union of SQAC and SQuAD datasets, with very little difference between the one fine-tuned only with SQuAD.Even though the results obtained with the union dataset are the best ones in general, it is important to stand out that some of the best results are obtained from the model fine-tuned only with SQAC (e.g., BLEU4 for AE).

B. MT0 models
Although they are not the best ones, results demonstrate that scores of mT0 models are near to the mT5 ones.In a similar way to mT5, the values obtained show a better performance of the model fine-tuned with the union dataset, followed by the one fine-tuned with SQuAD.However, the model fine-tuned with SQAC achieved the highest cosine similarity of the research study for the dataset SQuAD.Another important fact is that, focusing only on the results obtained by the mT0 models, the one fine-tuned with SQAC achieves the best results for BLEU1-4 and METEOR metrics in the AE task.Additionally, the one fine-tuned with the union dataset obtains the best results for the QG task.

C. Bloomz models
Even though these models provide the worst results in our research study, there is an important detail that must be mentioned.Focusing on the results, a huge difference exists between the ones obtained from the model www.nature.com/scientificreports/fine-tuned with SQuAD and the ones from the model fine-tuned with the union dataset.Please note that the first ones are the best for BLOOMZ models, while the others are the worst.

D. Base models
According to the results shown, the zero-shot technique with mT0 is more effective for the AE and QG tasks.However, their performance is lower in comparison with the previous models, which are fine-tuned for the tasks at hand, as it is expected 61 .

Discussion and conclusion
Answer Extraction and Question generation are ones of the most difficult NLP tasks 71 .Furthermore, these problems become even more challenging when languages other than English are used to address them.As a result, in other languages, such as Spanish, there is a huge lack of research about these problems.The appearance of multilingual models and new Spanish datasets is helping the development on NLP research regarding AE and QG.
In this study, we have taken the research made by Ricardo-Torrealba 41 and Chan and Fan 17 as baselines for our work.Following their studies, we have applied them to Spanish by fine-tuning three multilingual models (mT5base, mT0-base and BLOOMZ-560 M) with three different datasets in Spanish (SQAC, SQuAD, SQAC + SQuAD).Computing a wide range of metrics, their performance has been evaluated on the AE and QG tasks.
Our results demonstrate that, following an answer-aware QG technique with the mT5 model finetuned with SQAC + SQuAD, it is possible to outperform certain prior research for QG in English.For example, regarding the results provided in 20,25,26 in QG for BLEU4 and METEOR on the SQuAD dataset, there is a noticeable improvement.Moreover, our approach achieves slightly better results for BLEU4 and METEOR than the ones obtained by Liu et al. 23 .Additionally, we achieved better results for BLEU2-4 than the work proposed by Sun et al. 39 and improved BLEU4 results for the SQuAD dataset in 28 .Finally, comparing the results obtained by Wang et al. 18 in QG for BLEU3-4 and METEOR on the whole SQuAD dataset, we achieve some improvement.In addition, our study outperforms significantly the results presented by Ushio et al. 53 for the BLEU4, ROUGE-L and METEOR metrics obtained for QG in Spanish using mT5.This study, which is the only one that provides results for Spanish QG among all the related research found, evaluates different language models on QG-Bench dataset, a unified collection of datasets with the same format in which SQuAD 57 is included.The improvements achieved by our approach compared to all these studies are because of two main reasons: (1) effectiveness of data preprocessing that avoid raising the complexity of the system.Unlike other strategies that apply complex techniques such as masking the answer from the passage 25 , matching strategies between passage and answers 20,26 or hybrid answerfocused and position-aware models 39 , our approach is able to stand out the relationship between the passage and its associated answer without increasing the complexity of the system.Thanks to this preprocessing technique, the increase of complexity in the pipeline is avoided, making easier the understanding of this relationship to the model.The second reason is (2) the potential performance of the multilingual models used.Even though in this study it was not possible to use bigger models due to hardware limitations, the architectures of these models and their huge pretraining make it possible to outperform more sophisticated architectures such as the ones proposed by S. Wang et al. 18 , which uses knowledge graphs and divides the task of question generation into two steps, query representation and query-based question generation, or Liu et al. 23 , which builds the questions by identifying where each word in the question should came from a vocabulary or copied from the input text.
However, we are not able to outperform all the results found in literature.Comparing to our baselines, we almost reach Ricardo-Torrealba et al. 41 BLEU4 (21.32),METEOR (27.09) and ROUGE-L (43.59) metrics obtained for English QG, which uses T5 fine-tuned models by Patil 72 .Moreover, our results are near the ones obtained by Chan and Fan 17 using BERT for English QG in BLEU4 (20.33),METEOR (23.88) and ROUGE-L (48.23).Furthermore, the approach proposed by Sasazawa et al. 24 clearly improved our results with the incorporation of an interrogative phrase at the end of the passage.Also, the technique of selecting the best answer using a confidence score for the AE task and the QG technique proposed by Back et al. 40 improved our results.Finally, the model proposed by Murakhovs'ka et al. 22 , employing a total of nine English datasets, obtained the best results of all previous research without using the interrogative phrases technique 24 , raising awareness of the importance of the amount of data used for training.The improvement achieved by these works compared to our results could be achieved, in some way, due to the use of the English language, having more resources and prior research than in Spanish (e.g.Murakhovs'ka et al. 22 used nine available datasets for QA in English).Nevertheless, approaches such as the ones followed by Sasazawa et al. 24 , which used a different input format using interrogative sentences, might facilitate the question generation task to the model, resulting in obtaining better results.Additionally, the idea proposed by Back et al. 40 of obtaining several answers, ranking them and choosing the best ones, could bring a certain error range that is later corrected by choosing the best answer that has been generated, thus improving the results.Table 10 compares our results to those achieved by the studies that define our baseline and other related work.
Once the results are analyzed, five important facts can be highlighted.(1) The importance of high-quality Spanish datasets.(2) The importance of the amount of data used for training.(3) The significance of the finetuning process for AE and QG in multitask models.(4) The bias of the dataset towards a specific task.(5) The influence of multitask finetuning in the future use.These ideas are explained below.
Firstly, it has been shown that a high-quality dataset such as SQAC, built directly in Spanish without automatic translation, can significantly improve the performance of the models.Furthermore, it makes this improvement with a very low proportion of data compared to the SQuAD dataset.As a result, there is an important need to build high-quality datasets directly in Spanish, without automatic translation, to improve the performance of generative models.www.nature.com/scientificreports/Secondly, the results have shown that the greater the availability of data, the better behavior of the models.In general, the models trained with the combination of SQAC and SQuAD provide a better performance.However, this does not mean that it is better to create huge datasets leaving aside the quality of data.Indeed, the real objective should be the creation of bigger Spanish datasets with quality data focused on the AE and QG tasks.
Thirdly, regarding the results obtained for the base models, they are significantly worse than the others obtained by applying finetuning.Even though mT0 and BLOOMZ models were pretrained for multiple tasks and the same format was followed, it has been shown that a more precise training is needed to achieve better results.However, once finetuned these models, they achieve worse results than mT5, which is not a multitask model.Therefore, for the QG and AE tasks, the multitask training carried out seems to be insufficient.Please note that the size of the model takes an important role here, so zero-shot performance of bigger models is expected to be better.
Furthermore, when analyzing the outcome obtained, the SQuAD dataset provides more balanced metrics between the AE and QG tasks, while the SQAC results show off more correctness in the AE task.Consequently, it can be deduced that the nature of the dataset plays a very important role in the performance of each task.On the one hand, on average, SQAC has a longer passage length, which seems to improve the understanding of the model, obtaining more precisely the answers.On the other hand, SQuAD contains shorter questions than SQAC, being easier to obtain better metrics.As a result, longer passages improve the performance on AE.However, the performance of QG is biased by the length of the question because, since metrics compare n-grams, the longer the question is, the more difficult it will be to obtain it.
Finally, as shown with the BLOOMZ model, the results obtained are highly biased by the finetuning given to this model.Even though BLOOMZ is a multilingual model, it has been mostly pretrained with English datasets for the tasks of multiple choice and extractive question answering such as SQuAD.Hence, the results for Spanish datasets that were not used during the pretraining phase are very poor (e.g., SQAC).However, better results are obtained for the Spanish version of SQuAD, since the English version has been used in the pretraining process.
In conclusion, we have evaluated and compared through automatic evaluation metrics the performance of three multilingual models in the AE and QG tasks, using three different Question Answering datasets.Our results show that the best approach to solve the AE and QG problems in Spanish is the mT5 model fine-tuned with the union of SQAC and SQuAD, being even capable of outperforming some of the BLEU and METEOR metrics found on previous research in English.However, although we improve the results of some previous research studies, some other prior research studies in English achieve a better performance, standing out the lack of research for NLP in Spanish.Regarding the ROUGE-L metric, our models obtain worse results but, as it has been explained before, it is not a good discriminator for QG.Indeed, it is better for evaluating AE performance, where we obtained better results for this metric.However, there is no literature found to compare our results for this task.Consequently, we encourage future research to evaluate AE performance.Additionally, we propose the results of this study, including less common metrics but more adequate ones, such as cosine similarity, as a benchmark for future work within the AE and QG problems in Spanish.

Figure 1 .
Figure1.Format for the BLOOMZ model for the Answer Extraction task.The < s > token marks the beginning of the input, then the first part of the prompt is given.The part between quotation marks is the context, and the text contained between the < h1 > tag is the span where the model must focus on.Then, the second part of the prompt is given, after the < sep > token, which indicates the beginning of the target.< /s > means end of sentence.The translation into English is the following: Raw Input: "context": " < h1 > Javier is 10 years old and his brother is 15.< h1 > Javier has curly hair." "answers": ["10 years old", "15"].Figure's main body: < s > Given the following context " < h1 > Javier is 10 years old and his brother is 15.< h1 > Javier has curly hair.", < sep > extract answers: "10 years old < sep > 15 < sep > " < /s > .Input: < s > Given the following context " < h1 > Javier is 10 years old and his brother is 15.< h1 > Javier has curly hair.", < sep > extract answers: "10 years old < sep > 15 < sep > " < /s > .

Figure 3 .
Figure 3. Format for the mT0 model for the Question Generation task.In this case, mT0 has a similar format to BLOOMZ.Nonetheless, as mT0 is an encoder-decoder model, the target is separated from the input.The translation into English is the following: Raw Input: "context": "Javier is 10 years old and his brother is 15.Javier has curly hair." "question": "How old is Javier?" "answers": {"text": "10 years old", "answer_start": 13}.Figure's main body: < s > Given the following context "Javier is < h1 > 10 years old < h1 > and his brother is 15.Javier has curly hair.", < sep > generate a question whose answer would be: "10 years old".Input: < s > Given the following context "Javier is < h1 > 10 years old < h1 > and his brother is 15.Javier has curly hair.", < sep > generate a question whose answer would be: "10 years old".Target: "How old is Javier?".

Table 3 .
Evaluation metrics for mT5, mT0 and BLOOMZ finetuned for AE with the SQAC dataset.The best results for each dataset and metric are highlighted in bold.

Table 4 .
Evaluation metrics for mT5, mT0 and BLOOMZ fine-tuned for AE with the SQAC + SQuAD dataset.The best results for each dataset and metric are highlighted in bold.

Table 5 .
Evaluation metrics for mT5, mT0 and BLOOMZ finetuned for Question Generation with the SQuAD dataset.The best results for each dataset and metric are highlighted in bold.

Table 6 .
Evaluation metrics for mT5, mT0 and BLOOMZ finetuned for QG with the SQAC dataset.The best results for each dataset and metric are highlighted in bold.

Table 7 .
Evaluation metrics for mT5, mT0 and BLOOMZ finetuned for QG with the SQAC + SQuAD dataset.The best results for each dataset and metric are highlighted in bold.

Table 8 .
Evaluation metrics for mT0 and BLOOMZ base models for the AE task.The best results for each dataset and metric are highlighted in bold.obtains better results for SARI while mT0 obtains the best cosine similarity and WER values for the dataset SQuAD.

Table 9 .
Evaluation metrics for mT0 and BLOOMZ base models for the QG task.The best results for each dataset and metric are highlighted in bold.